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1 .Introduction 


A number of surveys have been conducted around airports 
to study the relationship between the level of exposure to 
aircraft noise experienced by people living in the area and 
their annoyance with it. A two-stage sample is commonly 
adopted, selecting a sample of clusters at the first-stage 
and then a sample of individuals within selected clusters at 
the second stage. Most airports have maps of noise contours 
which are often used for stratification at the first stage; 
generally a disproportionate stratified sample of clusters 
is drawn, oversampling those in the high noise areas. Often 
all individuals in a selected cluster are assumed to 
experience the same noise exposure, which may therefore be 
measured by a single set of physical measurements in each 
sampled cluster. 

In the simplest case, the regression coefficient for 
the simple regression of annoyance (y) on noise level (x) is 
the quantity of interest. Frequently annoyance is regressed 
on several noise-related independent variables, in which 
case the ratio of regression coefficients is often of 
interest (as with the noise and number index NNI ) . The 
issues addressed in this report are (1) the method of 
calculating standard errors for the estimated regression 
coefficients and for the ratio of estimated regression 
coefficients with a clustered two- or three-stage sample 
design and (2) the optimum way of allocating the sample 
across the stages of the sample design. 
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2 . Regression Model 

One approach to the specification of the regression is 
to take the regression coefficient in the population sampled 
as the quantity of interest. This population regression 
coefficient is 


B = 2(X^ - X)(y^ - Y)/Z(Xj - X)^ 

for the population of size N. Under this approach, B may be 
estimated by 


n n , 
b = 2wj(x^ - x)(y^ - y)/Sw^(x£ - x)^ 

where x = Sw.x./Zw., y = Jw.y./Sw. and w. are weights 
111 1*11 1 

inversely proportional to individuals' selection 
probabilities. Then the standard error of b may be 
estimated by techniques such as balanced repeated 
replication or jackknife repeated replication (Kish and 
Frankel, 1970, 1974). These techniques can take full 
account of the disproportionate stratification and 
clustering in the sample design. 

The attraction of treating the quantity of interest as 
a parameter of the finite population (B) is the avoidance of 
the model assumptions required for standard regression 
analysis. However, the consequence of not making such 
assumptions is that the sample estimator b estimates B only 
for the specific population sampled, and cannot be readily 
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applied to other populations. For the problem under study, 
the aim is to estimate a more general parameter, applicable 
to a wide range of populations (i.e. populations around a 
range of existing and proposed airports). For this reason, 
some regression model seems essential. 

The assumptions made with the standard linear 
regression model yi “ 3© ^ ^*i ^ ®i that Eie^) =0, 
VCe^) = 0 and Cov 0 i k. Under these 
assumptions 3 may be estimated by 

b = S(x. - 5)(y. - y)/Z(x. - (1) 

with X = Sx./n and y = Zy^/n. The variance of b is 

V(b) = o^/Z(x. - x)^ (2) 

With this model the x's are considered fixed by the design. 
The choice of x-values affects the magnitude of V(b), but 
the above formulae apply whatever values of x are chosen. 
From the sampling perspective, the x's are mainly determined 
by the disproportionate stratification, and the formulae 
automatically reflect this aspect of the sample design. To 
the extent that the sampled x's are not fixed by the design, 
the formulae may be treated as conditional on the x's 
obtained. 

While the standard regression model readily 
accommodates the effect of disproportionate stratification 
by X, it does not suitably reflect the clustering in the 
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sample design. The clusters used in sample designs almost 
always exhibit some degree of homogeneity with respect to 
the variables under study, and this homogeneity has also 
been found to occur with regression residuals. The 
consequence of this homogeneity is that the assumption 
Cov(e^,ej^) = 0 does not hold for individuals i and k in the 
same cluster. To handle this feature, the model may be 
extended to 

Yjj = Bo BXJ + Oi - 

where the subscripts (i,j) refer to individual j in cluster 
i, and is the cluster effect of cluster i. The are 

random effects with E(a^) = 0. Under the further assumption 
E(a^|x) = 0, or Cov(aj^fX) = 0, b in (1) remains unbiased for 
g, but equation (2) no longer holds for the variance of b. 
It should be noted that estimators of g that are more 
efficient than b are available for this model; however, for 
simplicity, we will consider only the simple estimator b. 

3. Variance of b 

With the double subscript notation the estimated 
regression coefficient b in (1) may be expressed as 
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b = 2E(x^ - x)(yjj “ y)/Zn^(xj^ - ic)^ 

= ZZ(x. - x)yj^ j/Znj (x^ - x) 

= Zrij^Cx^ - x)y./Zn.(xj - x) 

where there are sampled individuals in cluster i and 
yj = xyi^/nj. 

Conditional on the x-’s, the variance of b is then 

V(b) = Zn?(xj - x)^V(y^)/[Zn^(x^ - x)^]^ 

Under the model of y^j “ Pq ^ ^*i ■*“ “i ®ij with 
V(e.j) = Og, and V(a.) = a^, 

V(y.) = V(aj + i.) = 0 ^ + (Og/n.) 

Thus 

V(b) = Zn?(x. - 5)^(0^ + o^/n.)/[Zn.(x. - 

= to^Zn?(xj - x)^/[Znj(xj - x)^]^} + {Og/Zn^Cxj - x)^} (3) 

In the special case when the same subsample size is taken 
from each cluster, n^ = n, V(b) reduces to 

V(b) = [ol + (of/H)]/Z(x. - x)^ (4) 

a e 1 

= t(o^/a) * (o^/n)]/o^ (5) 

2 ^ - 2 

where o„ is defined as Z(x. - x) /a and a is the number of 

X X 

sampled clusters. 
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Defining the intra-class correlation coefficient for 
the clusters as the proportion of the variance of the y^j 
conditional on the that is accounted for by- the cluster 
effect, i.e. p = alternatively 

expressed as 

V(b) = (o^/n)[l + (S - 1)p]/o2 (6) 

2 2 2 
where o = o o . 

ot ® 

An estimator of V(b) may be obtained by substituting 

o o o o 

estimates and q~ for and of in (3) or (4). The 

a e a e 

2 

quantity o^ may be estimated by the residual mean square 
from a one-way analysis of variance of the y-values by 
clusters, that is by 

= SKy^j - yi>V(n - a) 

where n is the total sample size and a is the number of 

2 

sampled clusters. Then may be estimated by 

°a ■ ^o ■ - (n - 2)Og]/[X(a - 2)] 

where 2Z(y— " “ bx^) is the residual sum of squares 

from the regression of y on x, b^ is the sample estimate of 

2 2 

the intercept and X = {n - Zn.)/n(a - 2) (for X, see 

Anderson and Bancroft, 1952, Section 25.2; Snedecor and 
Cochran, 1980, Section 13.7). 
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4. Optimum subsatnple size, n 

In this section, we consider the optimum allocation of 

the sample between the first and second stages of the 

sample. We assume that the same subsample size n is taken 
from each selected cluster; the results obtained can also be 
applied as an approximation to situations where the 
subsample size varies to a small extent between clusters, in 
which case n represents the average subsample size. We 
assume a simple cost model of the form C = aC^ + nc, where 
is the cost of including a cluster in the sample, c is 
the cost of including an individual, and n = an is the total 
sample size. 

For given o„, the optimum choice of n that minimizes 

V(b) for fixed total cost C may then be readily obtained 

from the Cauchy-Schwartz inequality as follows. Write 
V(b) * Zx^, where x^ = o^//a and X 2 = Og//n, and C = Zy^, 
where y^ = /aC^ and y 2 = /He. Then the product V(b).C is 
minimized when 

(x^/y^) = (x2/y2) 

i.e. when _ 

a/a/C = o_/n/c 

CX o 6 

or = (o^/o„)(VO'/2 ' (7) 

This result can be equivalently expressed in terms of the 
cluster intra-class correlation as 

fippt = [(1 - p)/p]'/2[c^/c]'/2 


'( 8 ) 
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5. Example 

A study o£ traffic noise was carried out with a sample 
of n = 2933 cases in a = 53 clusters (Langdon, 1976). Of 
the 2933 cases, 2881 provided responses which are analyzed 
here. The average number of respondents per cluster is thus 
n * 54.358; the cluster sizes varied markedly, from the 
lowest of 20 respondents to the highest of 109 respondents. 
The' dependent variable for the regression is the answer to 
the question "How do you feel about traffic noise here?" 
(the end points of the scale are labelled "definitely 
satisfactory" and "definitely unsatisfactory") and the 
independent variable is the noise level (24 hour Leq dB(A)). 
The regression coefficient is b = 0.07971. 

The following sums of squares (SS) and degrees of 
freedom (d.f.) were obtained for the regression of annoyance 
on noise level; 


Source 


d.f. 


SS 


Regression 


293.1871 


Residuals 


2879 


10200.1250 


Total 


2880 10493.3121 
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The analysis of variance of the annoyance scores by 
clusters yielded the following results: 


Source 

d.f . 

SS 

Clusters 

52 

1525.0925 

Residuals 

2828 

8968.2196 

Total 

2880 

10493.3121 


From these results the following analysis of variance table 
for the regression residuals is constructed: 


Residuals 

d.f. 

SS 


B(MS) 

Between clusters 
after regression 

51* 

1231.9054 

24.15501 

* 4 

Within clusters 

2828 

8968.2196 

3.17122 

4 

Total regression 
residuals 

2879 

10200.1250 



Note that one degree of 

freedom is used 

for the 

regression 


2 

The residual variance is estimated by the within clusters 

residual mean square, i.e. = 3.17122. The expected value 

of the between clusters after regression residual mean 
2 2 2 2 

square is Xo * Of,/ where X = (n - En. )/nd and d is the 
degrees of freedom for the between clusters after regression 

residuals. An approximate value for X is the average sample 

- . 2 
size per cluster, n = 54.358. With n = 2881, Zn^ = 174,571 

and d = 51, the exact value of X is 55.302. Using this 

exact value. 
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= (24.15501 - 3.17122)755.302 


= 0.37944 


and 


p = o^/(o^ + Og) = 0.1069 or 10.7%, 


With p = 0.1069, from (8) 


Sopt = 2-89 UVc]'/2 


Values of n^p^. for various ratios of C^/c are given below: 


”opt 


5 

10 

20 

30 

40 

50 

6 

9 

13 

16 

18 

20 


A variance estimate for b is obtained by substituting 
sample estimates in (3). Using 


In?(Xi - x)^ = Zn?x? - 2xZn?x^ + x^2n?, 
Enj^Cxj^ - x)^ = (n - 1)o^ 


where x = 70.5917, = 4.0026 is the standard deviation of 

A 

X, 5:n?x? = 869,779,520.5, Zn?x^ = 12,303,941.25 and 
2 

Zn^ = 174,571, the following results are obtained: 

Zn?(x^ - x)^ = 2587392.84 
Zn^(xj - x)^ = 46139.92346. 

Substituting these values and and from above in (3) 
gives 
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v(b) = (46.11601 + 6.87305) x 10 ^ 

= 5.2989 X 10“^. 

The estimate of the variance of b from the standard 

-5 

regression analysis is 7.6788 x 10 , so that ignoring the 

cluster design underestimates the variance by a factor of 
6.90. This factor corresponds approximately to the 
multiplier [1 + (n - 1)p] = 6.70 in (6). 

Note that an approximate variance estimate for b is 
obtained by assuming n^ = n and using equation (5). Then 
v*(b) = 5.1558 X 10 This value is fairly close to that 

obtained above, even in this case where the n^^ are subject 
to substantial variation. This approximate variance 
estimate is 6.70 times as large as the estimate of the 
variance of b from standard regression analysis: this factor 
is the multiplier [1 + (n - Dp] = 6.70. 

6. Extension to regression with two independent variables 

We turn now to a linear regression of y on two 
independent variables x and z, both of which are constant 
within clusters: 

' Bo 

2 

Under the standard assumptions that E(e^) = 0, V(e^) = Oj. 

and Cov(e.,ej^) = 0 for i + k, 8^^ and 8^ estimated by 
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~ {Z(z^ - z)^Z(xj - x)(yj - y) 

- Z(xj^ - ic)(zj - z)Z(zj - z)(yj^ - y)}/A (9) 

bjj = {S(xj - x)^Z(zj^ - z)(yj^ - y) 

- Z(xj - x)(zj - z)Z(x^ - x)(y^ - y)}/A (10) 

where A = Z{x^ - x)^Z(z^^ - i)^ - [Z(x^ - x)(zj^ - z)]^. 

Under this model the x's and z's are considered fixed by the 

design. The choice of combinations of x and z values 

affects the precision of the estimators. 

Consider the estimators b„ and b„ under the model 

X z 



Bo 


Bx^i Bj,Zi + + e. j 


( 11 ) 


where aj is the cluster effect of cluster i, which is 
assumed to be a random effect with E(a^) = 0. Under the 
further assumptions that the are uncorrelated with the 
x's and the z's, b„ and b„ remain unbiased for and B„. 
since b^ and b, are of the same form, simply with x and z 
interchanged, it will suffice to obtain the variance of one 
of them, say b^^. Using the double subscript notation and 
letting the sum of squares of the z's be 

ZZ(z^ - z)2 = Zn^(z^ - z)^ = 


and the 

sum of cross-products 

of the 

z's 

and 

the x's 

be 

ZZ(x. - 

xXzj - z) = Zn£(x£ - 

x) (Zj - 

z) = 

^xz' 

\ may 

be 


expressed as 



13 


- 5)y. . - S^^ZKz. - i)y. .]/A 
= 2SC(S22<*i - *> - Sxz(Zi - z)]yij/A 
= - i) - Sj^^(z. - i)]y./A 

= Jn^C^yj/A 

where C. = ' ^). 

Conditional on the x's and z's, the variance of b is 

then 

V(b^) = Zn?C?V(y^)/A^ 

2 

Under the model given by (11) with = o* and 

V(ai> - <,2, 

V(y.) = V(a,- + e.) = oj + ial/n.) 

1 11 CX 6 1 

Thus 

OgZn.C?]/A^ (12) 

In the special case when n. = n, V(b ) reduces to 

A A 

V(b^) = n^ZC?[oJ + (o^/n)]/A^ (13) 

A 1 u 6 

Defining the intra-class correlation coefficient for the 

2 2 2 2 2 
clusters as p = o^/o , where o ” °e' 

expressed as 

V(bjj) = nZC?o^[l + (n - l)p]/A^. (14) 

In order to obtain the optimum subsample size, n, it is 
2 2 

useful to express ZC^ and A in terms of the variances of x 
and z and the covariance between x and z, which are defined 
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as 


= 2(x^ - x)^/a. 


°z " 


a„„ - S(x. - x)(z. - z)/a. Using this notation, S__ 
XZ X 1 z z 

^xz “ "®“xz' 


and 

naol, 


SC? = n^a^S[o?(x. - x) - o„„(z. - z)]^ 


xz 


= - 2clol, * 


" - ol^) 


and 4^ = n*a*(o^Oj - 

Substituting these values in (13) gives 


V(b ) = A[(oJ/a) + (of/n)] (15) 

A Oc 6 

where A = o^/<o^<.^ - o^^). 

The form of V(b ) in (15) is now the same as that for the 

A 

regression with the single independent variable in (5), with 
A replacing a^. It therefore follows that the optimum value 
of n is given by equations (6) or (7), namely 


Note, however, that is now the residual variance from the 
multiple regression. This residual variance can in general 
be e,xpected to be smaller than that for the simple 
regression, and hence the value of OQpt will also be 
smaller. Since the formula for n^p^ depends only on p and 
the cost ratio, it is the same for both regression 
coefficients, i.e. the optimum allocation is given by (16) 
whether 3 or 3 is being estimated. 

A Z 
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In order to estimate the variance of b^, first note 
that Sn.C? = Zn.(z. - i)^A, so that V(b ) in (12) may be 

X X 1 1 X 

written as 

V(bv> = (o^En?C?/A^) + (o^2n.(z. - z)^/A) (17) 

A X X w X X 

2 2 

where Zntct and A may be computed as 

Zn?C? = [2nj(z^ - z)^]^Zn?(xj - x)^ 

+ [Zn^(x^ - x)(Zj^ - z)]^Zn?(z£ - z)^ 

- 2Znj^(zj^ - z) Zn^(xj - x)(z^ - z)Zn^(x^ - x)(zj - z) 

- 2 - 2 - - 2 
A = Znj^(x^ - x) Zn^(zj - z) - [Zn^(x^ - x)(z. - z)] 

The variance of b^ can then be estimated by substituting 

A 

2 2 

sample estimates of and Og in (17). As with the simple 

2 

regression (p. 6), may be estimated by 

ol = ZZ(y.j - yi)^/(n - a) 

Then, noting that the regression sum of squares now has two 

2 

degrees of freedom, may be estimated by 

ol = [ZZ(yij - bo ■ ^x^i ■ ^z^i^^ - (n - 3)5^]/X(a - 3). 

2 

(With a multiple regression with K independent variables, 
may be estimated by 

0 ^ = [ZE(yjj - yjj)^ - (n - K - 1)5^l/x(a - K - 1) 

where - rb|^x,^j 

regression . ) 


are the predicted values from the 
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7 . Ratio of two regression coefficients 

With aircraft noise surveys one common analysis is to 
run a regression of respondents' annoyance with aircraft 
noise (y) on the levels of noise (x) and numbers of the 
noise events (z) to which they are exposed. The level of 
noise and number of noise events may be combined into a 
noise and number index (NNI). For this purpose the ratio of 
the regression coefficients, t = b_/b^, is needed. This 
section demonstrates that the optimum choice of n (assumed 
constant for all clusters) for estimating t is the same as 
that given in equations 8 and 16. The results in this 
section are derived using two slightly different 
applications of the Taylor's series expansion method for 
obtaining large-sample approximations to the variances of 
complex statistics. 

First Application 

Using the notation of the previous section, b may be 
expressed as Zn^Cj^y^/A, and b^ may be similarly expressed as 
Sn^^D^yj/A. Thus 

t = Sn^D^y^/Sn^C^y^ (18) 

Treating t as a function of the random variables y^, the 
approximate variance of t for large samples may be obtained 
from the Taylor's series expansion method. From this method 
the approximate variance of t is equal to that of its linear 
substitute, t , where 



t* = Z(5t/6yj)y£ 


and (6t/6y£) is evaluated at y^ = E(y£) = Y^, say. 

Now 

V(t*) = Z(6t/6y. )^V(y.) = Z(6t/6y • )^[o^ + ( 0 ^/ 0 ^)] 

Thus, in general, under the model given in equation (1), 

V(t) = Z(6t/6y. )^[oJ + (o^/n.)] (19) 

1 CL e 1 

Assuming a constant subsample size, n^ = n, 

V(t) =! Z(6t/6y. )^a(o^/n)[1 + (n - Dp] (20) 

= K(oVn)[1 + (B - Dp] (21) 

where K = aS(6t/6y£) . This equation is of the same form as 
2 

(6) with 0 replaced by 1/K. Thus, providing K is not a 
function of a or n, the optimum value of n for estimating t 
is the same as that for estimating b, i.e. the value given 
by equation (8). 

The following derivation demonstrates that K does not 
depend on a or n. First, from (18) with n^ = n it follows 
that 

5t/6yj = {(ZC^Y.)D^ - (ZDjY^)Cj}/(ZCjY^)^ (22) 

so that K is 

a{(ECjY^)^ZD?+(ZD^Y^)^ZC?-2(ZC.Yj) (ZD^Y. ) (ZC.D^)}/(ZCjYj)^ 
vhere SD.Y. = Sa=(o^Oy^ - 
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and 


SC? 

s2.3.2, 2,2 2 

■ " ® ‘’ 2 <“x°z “xz 

SD? 

" s'a^o2(o|o= - 0^, 

SCiDi 

= - «'='v‘Vz - 

°Yz 

= S(Y. - Y)(zj - z) 


xz 


°Yx " 2Yj(xj 


- z)/a 

- x)/a 


All the terms in the numerator of K have a common factor of 
-4 8 

n a and the denominator has this same common factor. On 

cancellation of this factor, K is seen to be a function only 

°x' °z' °xz' *^Yx' ‘^Yz* ^ does not depend on n or a. 

2 2 

Given values of a^, a^, and estimates of Oy^^, Oyzf 
2 2 

o and o„, an estimate of V(t) can be obtained by 
a e ' ■* 

substituting these values and estimates in equation (19) 
using (6t/6y^) from (22). 

Second Application 

An alternative approach for obtaining V(t) is to start 
with the Taylor expansion of the ratio t = b /b^. Thus 

Z X 


V(t) = p"^[v(b_) + T^V(b^) - 2TC(b^,b^)] 

X z X X z ( 23 ) 

where t = C(b„,b_) is the covariance of b„ and b^. 

Z X X Z X z 

From (15) 

- alUal/a) * (o^/n) ]/(o^o^ - (24) 

and v(b^) = o^[(oJ/a) + (o^/n) ]/(o ,02 - (25) 


The covariance term is obtained from 
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C(b^,b^) = Z(6b^/6yi)(6b^/6yi)v(yj), 

with .other terms in the summation being zero since 
C(y-,yj) = 0 for i + j. Expressions for b^ and b_ are 

1 J X z 

\ - i!)yi - - i)yj )/a(o2„2 - olj 


K = ■ °xz^<*i ■ 5)yil/a(o^o^ - al^) 


Thus C(b^,b_) is 

X Z 

[o^+(Og/n) ]S[02(Xi - x)-0jj2(2^-z) ] [0^(Zj-z)-0jj2^^i"^^ ^ 

2 2 2 TT5 

^ ^°x°z ■ '^xz^ 


' - uImvu 14 - 4 ,) 


(26) 


Substituting (24), (25) and (26) in (23) gives 


V(t) = 


[(o^/a) + (Og/n)][o^ + 

_2, 2 i i \ 

^x °x°z °xz 


Pxz = '’xz/^xS' 


v(t) = 


[(o2/a) + (o^/n)][(1/o2) + (TVo2) + (2Tp^2/Oj^02)] 

- pL) 


(27) 


A variance estimate V(t) is obtained by substituting sample 

O O 

estimates o^, Og, t and b^^ for the respective unknown 
parameters in (27). 

The accuracy of the approximate variance of the ratio 
t = b_/b obtained by the Taylor expansion method depends on 
the coefficient of variation of the denominator of the 
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ratio, i.e. CV(b ) = /V(b )/B^. A CV(b^) of less than 0.2 
and preferably less than 0.1 is required if the Taylor 
expansion method is to produce a satisfactory approximation 
of V(t). (It should be noted that a low CV(b ) also ensures 

A 

that the bias of t is negligible). A check should be made 
that the estimated coefficient of variation 
cv(b ) = /v(b )/b is less than 0.2; if this condition is 

X A A 

not satisfied, the Taylor expansion variance estimate should 
not be used. In any case, if this condition is not 
satisified, the utility of the index t should be critically 
examined. 

For the first application of the Taylor expansion 
method, the equivalent condition is that the coefficient of 
variation of the denominator, i.e. CV( Zn j^C^y^ ) , should be 
small, less than 0.2 and preferably less than 0.1. 

8 . The case of variable x in clusters 

The previous sections have assumed that x^ is a 
constant value within a cluster. We now consider the case 
where x takes different values within a cluster, individual 
j in cluster i having a value ^ij* The regression 
coefficient 3 is assumed to be the same within each cluster. 
In this case the treatment of the simple regression 
discussed in section 3 is modified as follows. 

The simple regression coefficient is now given by 
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b = 2Z(x. j - x)y.j/SE(x^j - x)^ 

= ZSCijYij/SSC^j ( 28 ) 

with = (x£j - x). The variance of b is then 
V(b) = {2iJ^C?jV(y. .) * 

where C(y-^j, y^j^) is the covariance of y^j and y^j^. 

Now V(y.j) = oj + 0^ 

and C(yij, y.^) = E[y.j - E(y.j)][y.^ - E(y.^)] 

= E[(ai + e. j)(ai + e.^)] 

2 

= 

Thus 

v(b) = {o^[ZZC?j + ZZZq.Cik^ oJ(ZZcf j)}/(22cfj)^ 

= {o2[Ei(Zjqj)2] + o2(EZC?j)}/(ZE^?j)2 

= {Qjzn?(x. - x)^/[ZZ(x. . - x)^]^} + {o^/IZ(x. . - x)^}(29) 

This formula is the generalization of (3); substituting 
Xij - = Xj in (29) yields (3). 

In order to examine the optimum subsample size, 
consider the case with n^ = n. Denoting the proportion of 
the variance in x explained by the clusters as 

Ti^ = nE(x. - x)^/SZ(x.j - x)^, (30) 

the variance of b is given by 



V(b) = [(o^nVa) 

+ iol/n)]/al 

(31) 

or 

V(b) = (o^/n)[1 

+ (nn^ - 1)p]/o^ 

(32) 
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These formulae are the same as equations (5) and (6) except 

that 0 ^ is replaced by o^ti in (5) and n is replaced by nil 

in (6). Thus by redefining x in Section 4 to be o Ti//a, the 

oc 

optimum value of n is obtained directly as 

= (oy-na„)(C /c)^/2 (33) 


opt 


a 


or equivalently as 


"opt = " p)/p]^^^[C 3 /c]^/ 2 (i/^) 


(34) 


Note that if Ti = i*e. x^j = x^ for all j, then 
reduces to that obtained in Section 4. If ti = 0, i.e. the 
cluster means for x are all the same so that the variability 
in X is all within the clusters, ^opt ~ with only one 
cluster being sampled. 

9. A three-stage design 

In this section we consider a three stage sample 
design. At the first stage a primary sampling units (PSU's) 
are selected; next n^^ second stage units (SSU's) are 
selected within PSU i; and finally n.j elements are selected 
in second stage unit j in PSU i. The regression model with 
a single independent x-variable extends to 

* “i ®ij ®ljk 

where ot*^ is the cluster effect of PSU i and is the 

cluster effect of SSU ij. The and fi.j are random effects 
with E(ai) = E(6:^) = 0, E(a?) = o^, E(6?.) = al, and 

1 ij la ijo 
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E(a^|x) = E(6^j|x) = 0 . The x-variable is assumed to be 
constant within a SSU, and the regression coefficient is 
assumed to be the same within each PSU. 

The simple regression coefficient may be expressed as 


b = 


_ ■ - x)y. _ ZZn. .(X. . - x)y. ■ 


ZS 2 (x. . - x)^ 


ZZn.j(x.j 


- 5 ) 


= ZZn. .r. .y. ./ZZn. .C. • 
where - x). Then 


( 36 ) 


V(b) 


^ ^ ZZZn. .C^.n.^q|,C(y^., y.,^) 


(ZZn. . rr . ) 2 




( 37 ) 


Now y^j = Bq ^^ij “i ^ij ®ij' so 


V(?ij) = 0^ + O 5 * 


( 3 B) 


and Cov(y. yji^) = E( 0 i + 6j. + 


= 0^ for 3 + k. 
a j T 


Thus 


( 39 ) 


V(b) = 




(XJn..5fj)2 
2 2 2 


2 2 
o ZZn. .rr . 
e ii^iT 

(ZZn. -C — )^ (ZZn. .E? . )^ 
13^13 13^13 






JZnijft. 


( 40 ) 
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Consider now the case where the same number of SSU's is 
taken from each PSU, = d, and the same number of elements 
is taken from each SSU, n^j = n. The V{b) in (40) reduces 
to 


V(b) 


+ + ^ 

(SZCij) nssclj 


Denoting the variance of x as 


= ZZ(x. . - x)^/ad 

A 1 ] 

2 

and the proportion of Oy, explained by the PSU's as 

2 dZ(xj - x)^ 

T) ~ ' _ ^ ~ ' 

ZZ(x^j - x)^ dzzqj 


V(b) may be expressed as 


V(b) 


2 2 



+ 



/o 


2 

X 


(41) 


(42) 


(43) 


(441 


To determine the optimum values of n and c, consider 

the simple cost model C = ac^ + adc^ + nadc, where c^ is 

cost of including a PSU, c^ is the cost of including a SSU 

and c the cost of including an element in the sample. 

2 

For given o„, the optimum choice of n and d that 

A 

minimize V(b) for fixed total cost C can be obtained from 


the Cauchy-Schwartz 

inequality 

as follows. 

Write 

V(b) = ZVj/n. = Zu? 

where = 

2 2/2 
V /“x' 

^2 = 

2 / 2 

V 3 = Og/o^, n^ = a, 

n 2 = ad. 

n^ = adn. 

and 

write 
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2 

C = Sc.n^ = Sw^, where = c^, C 2 = and Cg = c. Then 

the product VC is minimized when u^/w^ = constant. This 


condition requires first that 

(og/Ojj/aH) 

/c^ /a 

so that = 

opt 

The condition also requires 
that 


{U2/W2) = i*®* that 

(a^/a^-Zinad) ) 

/c /(nad) 

(c^/c)^/2(Og/o^). (45) 

that (u^/w^) = (u 2 /w 2 )f i.e. 


(o^Ti/Cj^/a) (og/Oj^/aH) 

/c~ /a /cT /aH 

a a 

so that d^p^ = (Cg^/c^) ^^^(OgAo^) (46) 

The optimum values of and d^p^ given by (45) and (46) 

may then be combined with the constraint of the total cost C 
to determine the value of a. 
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